Image Captioning

Files Submitted

Criteria Meet Specification

Submission Files

The submission includes model.py and the following Jupyter notebooks, where all questions have been answered:
2_Training.ipynb, and
3_Inference.ipynb.

model.py

Criteria Meet Specification

CNNEncoder

The chosen CNN architecture in the CNNEncoder class in model.py makes sense as an encoder for the image captioning task.

RNNDecoder

The chosen RNN architecture in the RNNDecoder class in model.py makes sense as a decoder for the image captioning task.

2_Training.ipynb

Criteria Meet Specification

Using the Data Loader

When using the get_loader function in data_loader.py to train the model, most arguments are left at their default values, as outlined in Step 1 of 1_Preliminaries.ipynb. In particular, the submission only (optionally) changes the values of the following arguments: transform, mode, batch_size, vocab_threshold, vocab_from_file.

Step 1, Question 1

The submission describes the chosen CNN-RNN architecture and details how the hyperparameters were selected.

Step 1, Question 2

The transform is congruent with the choice of CNN architecture. If the transform has been modified, the submission describes how the transform used to pre-process the training images was selected.

Step 1, Question 3

The submission describes how the trainable parameters were selected and has made a well-informed choice when deciding which parameters in the model should be trainable.

Step 1, Question 4

The submission describes how the optimizer was selected.

Step 2

The code cell in Step 2 details all code used to train the model from scratch. The output of the code cell shows exactly what is printed when running the code cell. If the submission has amended the code used for training the model, it is well-organized and includes comments.

3_Inference.ipynb

Criteria Meet Specification

transform_test

The transform used to pre-process the test images is congruent with the choice of CNN architecture. It is also consistent with the transform specified in transform_train in 2_Training.ipynb.

Step 3

The implementation of the sample method in the RNNDecoder class correctly leverages the RNN to generate predicted token indices.

Step 4

The clean_sentence function passes the test in Step 4. The sentence is reasonably clean, where any <start> and <end> tokens have been removed.

Step 5

The submission shows two image-caption pairs where the model performed well, and two image-caption pairs where the model did not perform well.

Tips to make your project standout:

  • Use the validation set to guide your search for appropriate hyperparameters.
  • Implement beam search to generate captions on new images.
  • Tinker with your model - and train it for long enough - to obtain results that are comparable to (or surpass!) recent research articles.